InstaPrism: an R package for fast implementation of BayesPrism

Abstract Summary Computational cell-type deconvolution is an important analytic technique for modeling the compositional heterogeneity of bulk gene expression data. A conceptually new Bayesian approach to this problem, BayesPrism, has recently been proposed and has subsequently been shown to be superior in accuracy and robustness against model misspecifications by independent studies; however, given that BayesPrism relies on Gibbs sampling, it is orders of magnitude more computationally expensive than standard approaches. Here, we introduce the InstaPrism package which re-implements BayesPrism in a derandomized framework by replacing the time-consuming Gibbs sampling step with a fixed-point algorithm. We demonstrate that the new algorithm is effectively equivalent to BayesPrism while providing a considerable speed and memory advantage. Furthermore, the InstaPrism package is equipped with a precompiled, curated set of references tailored for a variety of cancer types, streamlining the deconvolution process. Availability and implementation The package InstaPrism is freely available at: https://github.com/humengying0907/InstaPrism. The source code and evaluation pipeline used in this paper can be found at: https://github.com/humengying0907/InstaPrismSourceCode.

Adhering to the same conceptual framework established by BayesPrism [1], InstaPrism also interprets the deconvolution problem as a topic model analogy, with reads in bulk RNA-seq equivalent to a word, each bulk RNA-seq sample equivalent to a document, each cell state equivalent to a topic and each gene equivalent to a vocabulary.The topic modeling is performed at the cell state level to accommodates multiple gene expression subtypes (referred to as "cell states") within a cell type, instead of representing a cell type with a uniform expression profile.The model directly outputs posterior estimates for fractions (document-topic distribution) and gene expression (document-vocabulary distribution) at cell state levels, which are then summed up to yield the final posteriors at the cell type levels.
When using topic modeling for deconvoltuion, both methods maintain the following simplifications compared to traditional latent Dirichlet allocation (LDA) process [2]: 1) The topic-word distribution is predefined and remains constant throughout the model.This eliminates the need for further inference of this parameter [3], reducing computational costs and allowing for independent inference for each "document" (bulk sample); 2) The Dirichlet prior α of the LDA is non-informative, making the posterior mainly driven by the likelihood.BayesPrism adopts a fixed and non-informative Dirichlet prior (α = 10 −8 ) in their model, imposing negligible effect when sampling the document-topic distribution (the cell state fractions θ S×1 in this context) from the Dirichlet distribution.In InstaPrism, we adhere to the same idea of a non-informative Dirichlet prior and directly update θ without information from the Dirichlet prior.
InstaPrism differs from BayesPrism in how topic modeling is performed.While BayesPrism relies on Gibbs sampling in model optimization, InstaPrism directly computes the expected mean without the need for sampling.
To illustrate how InstaPrism algorithm differs from BayesPrism algorithm, we consider a single bulk gene expression sample with G genes, X G×1 , and a cell-state reference matrix, A G×S , where S is the total number of cell-states.BayesPrism infers two variables, the proportions of cell-states in X, θ S×1 and cell-state specific expression Z G×S .In the InstaPrism algorithm we keep track of another matrix called B G×S which depends on both the static reference input A and the current proportion estimate θ.Each row of B is the mean of a multinomial distribution for assigning the reads of gene g ∈ [1, G] to the S different cell-states.This is a reformulation of the two-step multinomial model described in BayesPrism.In particular, at each iteration we maintain s B g,s = 1 for all g ∈ [1, G] by normalizing the rows.
Overall the parameters of InstaPrism/BayesPrism can be summarized as follows: • Input: X G×1 : gene expression in a given bulk We use notation Z g, and Z ,s to indicate genes/rows and cell-states/columns of Z respectively.
Given these variable definitions, we present a detailed comparison of two algorithms as follows: Algorithm 1 InstaPrism: Fixed-point algorithm The InstaPrism algorithm essentially finds θ, which is the fixed point of the above iterative updates.The fixed point θ has the property that it assigns each gene's read counts (or expression values) according to the renormalized A ′ g, • θ and forms the deconvolved matrix Z G×S that obeys the following equality: The algorithm can also be seen as a derandomization of the original BayesPrism method by replacing the iterative sampling process for each gene, into a single vectorized computation.This means that instead of looping over all the genes to acquire the necessary input for θ update, the algorithm now achieves this with one single step.Despite this simplification, InstaPrism maintains performance integrity: both algorithms exhibit comparable fraction update trajectories (Figure S1), suggesting that InstaPrism is methodologically equivalent to BayesPrism minus sampling.In InstaPrism, the cell state specific expression Z G×S can be reconstructed by: With the posterior estimates of fractions θ S×1 and gene expression Z G×S at cell state level, both methods now calculate the final deconvolution estimates at cell type level as follows: We note that the comparison mentioned above mainly focuses on the algorithmic aspect.In real practice, another notable difference is the inclusion of the "reference update" step in the deconvolution process.While BayesPrism by default performs two rounds of topic modeling: initially with an scRNA-based reference and subsequently with an updated reference, InstaPrism provides deconvolution results using only the scRNA-based reference by default.
The "reference update" feature introduced by BayesPrism aims to utilize the shared information across bulk samples to refine cell type fraction estimates [1].This process begins by using posterior estimates from the first round of deconvolution, specifically the cell-type specific expression Z G×N ×K (the combined Z G×K across N samples), to establish a new sample-specific reference profile ψ.The updated reference differs from single-cell based reference in the following ways: 1.The reference of malignant cells, denoted as ψ mal g×n , is unique for each individual bulk sample.Specifically, for the gth gene of the nth sample, , where t indicate the malignant cell type 2. The reference of the non-malignant cells (or environmental cells), denoted as ψ env g×k , is shared across all bulk samples.Again, the calculation of ψ env directly utilizes the deconvolved cell type-specific expression Z G×N ×K .Specifically, for the gth gene for the kth cell type, InstaPrism updates the reference as follows: The updated reference no longer preserves information at the cell state level; instead, it only contains information at cell type level.
This updated reference is then subjected to topic modeling again to yield the updated posterior estimates.In InstaPrism, this functionality is available as an optional feature through the InstaPrism_update() function.This function provides updated posterior fraction estimates that are nearly identical to those from BayesPrism (data not shown), and additionally updates the posterior of gene expression, a feature not available in BayesPrism.
Given that the "reference update" step involves pooling information from input bulk samples, the quality of the updated reference can be highly input-dependent and there is yet no definite answer to decide under what circumstances reference update is appropriate.It can be anticipated that the quality of the updated reference can be easily influenced by factors such as number of samples and the quality of the input.
Below, we present empirical testing results we obtained when comparing the deconvolution performance using initial reference (the InstaPrism built-in reference) and the updated reference, using a set of validation datasets that include both real bulk samples from the TCGA-cohorts and simulated samples (for validation datasets construction details, see Text S5; for datasets details, refer to Table S1 and S3).Although improvements were observed in tumor purity estimation [5,6,7] for the TCGA cohorts using the updated reference (Fig. S2a), the overall benefit remains uncertain since we don't have access to ground truth fractions for other cell types.It is likely that we achieve improved predictions for the malignant population only, but compromise predictions for other cell types.Additionally, it is likely that a large number of input samples (see Table S3 for details) is a key factor that leads to benefits from the updated reference, which is not typical in real-world deconvolution settings where input sizes are often limited.
In a separate evaluation (Fig. S2b), we tested the effect of reference updates using 12 simulated datasets (Table S1), each comprising 50 simulated samples with available ground truth fractions for all cell types.Indeed, in some datasets, such as the two simulated LUAD datasets, we observed that the reference update benefits predictions for malignant cells only, while decreasing prediction accuracy for other cell types.For 9 out of the 12 datasets, the updated reference worsens the overall estimation of the cell-type fractions, whereas in the remaining three datasets, it only slightly improves the estimations.Overall, no clear advantage of using a reference update is observed.We also attempted to pool information at the cell state level to maintain the same granularity level of the scRNA-based reference (using cell-state specific expression Z G×N ×S for ψ calculation); however, this approach proved even worse (data not shown).Taken together, given the inputdependent nature of "reference update" and our empirical test results, we recommend adhering to the scRNA-based reference for optimal performance, and suggest only considering updated reference for malignant estimation.

Text S2: Practical guidance on reference construction
While independent studies have demonstrated that the Bayesian deconvolution framework of BayesPrism/InstaPrism excels in accuracy and robustness for deconvolution tasks [8,9], it's important to note that the model intrinsically requires a reference hyperparameter, denoted as A G×S , that specifies the event probability of the gth gene in sth cell state.
In InstaPrism, we provided a refPrepare() function to construct such reference using userprovided scRNA-seq data and cell annotation information.The function calculates the average expression values of single cells within the same cell states and then renormalizes these values such that the sum across genes is 1, yielding the final A G×S reference matrix.This hyperparameter A G×S encapsulates extensive prior knowledge on gene covariance structures and requires no subsetting of the marker genes.The proper configuration of this hyperparameter is important for model performance.Here we provide several guidelines for configuring this hyperparameter.

Collect scRNA data from matched tissue type
To begin with, one need to collect scRNA-seq data from matched tissue type, following the general assumption that the constructed reference will capture the major cell types present in the bulk sample [10].The ideal scRNA-seq datasets should be of high quality, presented on a linear (non-log) scale, include pre-defined cell type annotations, and ideally contain sample-level annotations that specify the biological origin of the single cells.Available scRNA-seq databases for this purpose include, but are not limited to, the 3CA atlas [11], CellxGene [12], and the Single Cell Portal [13].
Specifically, opting for larger and higher quality datasets helps mitigate the dropout issue [14] commonly encountered in single-cell data.Given that reference construction relies on aggregating gene expression profiles across individual cells, a larger sample size can reduce the impact of dropout.The well-labeled single-cell datasets offer the advantage of efficiently identifying cell types and more detailed cell subtype (referred to as "cell state") information, which is crucial for effective reference construction.And the sample-level annotation information can aid in further finer-grained cell subclustering when needed.
We acknowledge that with recent advancement in single cell technology, the availability of scRNA-seq datasets has significantly increased.When multiple candidate datasets are available, one can utilize the evaluation pipeline we proposed (https://github.com/humengying0907/InstaPrismSourceCode) to systematically compare between references and select the one that is optimal.

Cell state specification
Cell state labeling information is another key component in A G×S construction.It implies two layers of information: 1) the cell state of the single cell, and 2) the cell type to which this cell state is affiliated.This information is passed to the refPrepare() function to indicate which single cells to aggregate together to construct the A G×S reference matrix, and is subsequently used to aggregate the posterior estimates of cell states within the same cell type, yielding the final posterior estimates at cell type level.
In practice, cell state information can be obtained from original single-cell annotations when subclustering information of the identified cell types is provided.When such information is missing and only cell type annotations are available, we recommend the users to use the get_subcluster() function provided in InstaPrism to get the fine-grained cell state information within each cell type.This function incorporates two subclustering methods: the "quickCluster" method from scran package [15] and the "SCISSORS" method [16].Specifically, the "quickCluster" method offers a significant speed benefit for clustering, and the "SCISSORS" method includes a Silhou-etteScores.cutoffparameter to determine cell types requiring re-clustering and can automatically determine optimal parameters for subclustering [16], providing a convenient and efficient way to capture the inherent subclustering structures within cell types.Additionally, for heterogeneous populations like malignant cells, one can assign cell state labels based on their source of origin (using patient identifiers), when other subclustering methods are computationally intensive.We recommend having at least 20 cells to represent each cell state, in line with the recommendations from BayesPrism.
We acknowledge that cell-type subclustering is an unsupervised problem (check next section for details), entailing various methodologies choices including granularity levels, clustering techniques and clustering hyperparameters.While there is no definite answer to this selection, the quality of resulting reference A G×S can be assessed using the evaluation pipeline we provide, which helps determine the most appropriate methods to use.

Reference evaluation pipeline
To assess the performance of the reference, one can utilize the reference evaluation pipeline we provide, which includes the following steps: 1).Simulation of bulk samples with known proportions.Bulk samples with known cell type proportions are crucial for evaluating deconvolution performance, as they provide ground truth proportions for results comparison [10].One common method to generate this type of data is to simulate bulk data by aggregating single cells in predefined proportions, using scRNA-seq dataset with annotated cell type information [8,17].These predefined proportions are then used as the ground truth fractions to assess the accuracy of deconvolution results.The pipeline incorporates the "heterogeneous" simulation strategy we proposed earlier [8], which is designed to reflect realistic biological variance among the simulated samples.Essential inputs for this process include raw scRNA-seq data (in non-log transformed scale and sourced from the same tissue as the reference) and cell-type/sample-ID labels of the single cells, which are necessary to guide the aggregation of cells and maintain appropriate biological variance within the simulated samples.The pipeline also supports other user-provided simulated data, provided that these samples are in non-log transformed scale and include ground truth cell type fractions.We recommend users to employ scRNA-seq datasets that are different from the one used to generate the reference, to prevent information transfer from the reference to the bulk samples 2).Collection of real bulk samples with known proportions.While simulation serves as a potent tool for reference evaluation, incorporating real bulk samples is beneficial to fully account for the application on actual datasets.Our pipeline offers the option to integrate real bulk samples from TCGA cohort data for performance evaluations.Specifically, it applies the build_tcga_obj() function from the deconvBenchmarking [8] package, which downloads the expression data of a given tumor type from the xena browser [18] and automatically retrieved TCGA purity estimates from various methods [19] including ABSOLUTE [5], ESTIMATE [6], CPE [7] and LUMP [7].The tumor purity estimates are used as proxies for malignant proportions, which will later be compared against the malignant estimations for performance evaluation.

3). Deconvolution & Performance evaluation
Once bulk samples with known ground truth fractions are prepared, we can run deconvolution with user provided reference and directly visualize the results (Fig. S3). Figure S3: An example of the performance visualization plot.A representative scatter plot returned by the evolution pipeline showing the performance of the built-in CRC reference (CRC_refPhi) on simulated CRC samples.Two sets of plots are provided: "initial", using scRNA-based reference, and "updated", using the updated reference.For each cell type in the simulated bulk samples, the matched reference cell types are denoted by "maxCorName" and the performance metrics including Root Mean Square Error (RMSE) and Pearson correlation (cor) are provided.
The evaluation pipeline allows simultaneous assessment of multiple references and includes the option to include the "updated reference" (see Text S1 for details) as well.Considering the possible discrepancies between cell type names in the ground truth and the reference, the pipeline is designed to automatically identify pairs of matched truth-reference cell types based on the maximum Pearson correlation.The deconvolution performance, evaluated using Pearson correlation coefficients and RMSE (root mean square error) values between the ground-truth and estimated fractions, will then be documented and presented in a scatter plot (Fig. S3) for direct visualization.
The pipeline is now readily available at https://github.com/humengying0907/InstaPrismSourceCode,providing a convenient and user-friendly way for reference evaluation.

Recommendations on reference selection
After obtaining the performance data from the reference evaluation pipeline, we recommend the users to 1) Choose a reference that contains a broad range of cell types.2) Select reference that excels in Pearson correlation and RMSE values for each cell type; and 3) Select reference that consistently demonstrates good performance across the test datasets.
Again, we acknowledge that reference construction using scRNA-seq data remains an open challenge in the field of deconvolution, and there is yet no closed-form solution.Even for scRNAseq datasets from the same tissue type, the resulting references can vary [20].Various factors impact the consistency between references constructed from different datasets, including biological differences among samples (such as age, ethnicity and spatial heterogeneity), technical differences (such as labs, experimental protocols) and data processing differences (computational pipelines and cell-type annotation methods) [10].When multiple references are available, one potential technique is to adopt a "consensus" deconvolution approach, which compares results from various references.Cell type fractions that show consistent results across references will be considered reliable, while those that vary significantly require extra attention.
To facilitate the reference construction process, we have provided a set of built-in references covering a broad range of cancer types (Table S1), following the guidance we provided above.The built-in reference is now readily available by the InstaPrism_reference() function, or can be downloaded directly from https://github.com/humengying0907/InstaPrismSourceCode.

Text S3: Impact of cell state specification on model performance
In the above section, we provided a general guideline for reference construction and introduced the evaluation pipeline to assess a reference's performance.However, in real practice, specifying cell states in the reference involves several methodological decisions, such as choosing a cell state clustering method, determining the granularity levels of cell states, and deciding which cell states to include in the reference.In this section, we will explore how different cell state specifications impact model performance and provide guidance on this hyperparameter setting.

Impact of duplicated reference
Consider the following example where we generated two identical columns for the Oligodendrocyte cell type in the reference matrix A G×S , using the built-in GBM reference (GBM_refPhi) from our package.Analyzing simulated GBM data (Table S1) with this reference, we observed that when duplicated cell states (Oligodendrocyte1 and Oligodendrocyte2) are present, their fraction estimates are identical.Each duplicated cell state accounts for a portion of the ground truth Oligodendrocyte fraction, and the sum of these fraction estimates matches the ground truth fraction levels (Fig. S4).
This indicates that in the presence of duplicated cell state information, the model adjusts the posterior estimates accordingly.As the model aggregates these estimates across cell states to produce the final cell type level posterior, the fraction estimates at cell type level stays robust.Experiments using other references also confirmed this observation (data not shown).This highlights the model's ability to handle redundancies in cell state specification.More importantly, this implies that even if very similar cell states are identified during the cell state clustering step, their fraction estimates are essentially additive.In fact, as long as the clustering captures the necessary heterogeneity (check next subsection for details), regardless of the specific clustering approach employed, the aggregated result remains reliable.

Impact of cell state granularity
Cell state granularity affects the resolution at which cellular differences are discerned and classified.The finer-grained the subclustering, the more cell states are identified and included in the reference.Theoretically, representing a cell type with different cell states captures diverse molecular characteristics present [21], which is especially valuable for heterogeneous groups like malignant cells.However, determining the optimal level of granularity is an unsupervised issue and does not have a closed-form solution.
With the increasing number of single cells analyzed in singe cell RNA sequencing, many scRNA-seq studies now provide not only cell type annotations but also finer-grained cell state annotations that reveal expression differences within cell types [22].Utilizing cell state information from these original annotations provides a convenient way to specify the cell state granularity level.
Theoretically, other granularity levels exist as well.To explore how varying levels of granularity affects the performance of cell type deconvolution, we considered four levels of granularity: • granularity_v0: all cell types are represented as a single profile without further subclustering • granularity_v1: only the malignant populations are subjected to subclustering • granularity_v2: granularity level provided by the scRNA studies, with major cell types being subclustered and labeled according to their transcriptional profiles (the same level as in InstaPrism built-in reference) • granularity_v3: cell types are being reclustered using the get_subcluster() function from InstaPrism package, with a finer-grained granularity level compared to granularity_v2 Building reference with various granularity levels for cell states and applying it to simulated data (Table S1), we observed that i) specifying no subclustering information (granularity_v0) or including subclustering information (granularity_v1) in malignant cells only generally results in poorer performance; ii) adding finer-grained information (granularity_v3) on top of the granularity level provided by the scRNA studies does not yield additional benefits for deconvolution performance, and iii) the subclustering method is not the dominant factor in performance, as cell states specified with different methods (granularity_v2 and granularity_v3) exhibit similar performance.This indicates with subclustering information specified in both malignant and non-malignant cells (as done in granularity_v2), transcriptional difference within cell types are properly maintained, which is beneficial for accurate cell type fraction prediction.The results also align with our previous findings that duplicated cell state has a completely additive effect, as adding more cell states (granularity_v3) does not notably alter the deconvolution results.
We note that here we do not explicitly recommend the subclustering methods or the exact number of cell states to be used.The general principle is to use any subclustering method, as long as it can discern cellular differences across both malignant and non-malignant cells.For tumor types with known molecular subtypes, such as breast cancer, which includes HER2enriched, Luminal A, Luminal B, and Basal-like subtypes [23], the malignant cell states should encompass these major subtypes.For non-malignant cell types such as lymphoid and myeloid cells, subclustering is highly recommended to capture potential molecular differences and enhance deconvolution performance.When a comparison of granularity levels is needed, one can refer to our evaluation pipeline for direct assessment.

Impact of missing cell types
Another challenge in reference construction is the "missing cell type" problem, where certain cell types present in the bulk samples are absent from the reference.This issue typically occurs with rare or low-frequency cell types that are not adequately recognized and documented in the scRNA-seq datasets used for reference construction [24].The direct consequence to this is obvious: there will be no fraction estimation for the missing cell types.Moving forward, to explore the impact on the prediction of the remaining cell types, we will consider the following examples: Using the InstaPrism built-in LUAD reference and a simulated lung cancer dataset (Table S1), we gradually remove one cell type at a time from the reference, starting with the least abundant cell types in the simulated bulk.The process continued until the reference contained only four major cell types: malignant cells, macrophages, T cells and B cells.Note that during this process, the cell types present in the reference retained the same cell state information.In Fig. S6, we observed that despite this reduction, the fraction prediction for the remaining cell types remains robust.Even with the last reference, the Pearson correlation for the cell types was still around 0.92.Experiments using other references also confirmed this observation (data not shown).We note that in the above testing we focus exclusively on using the InstaPrism built-in reference, which primarily targets the tumor microenvironment.In addition, when removing cell types from the reference, the remaining cell types retain the same cell state information.However, in a separate testing environment when the above conditions are not met, different results may be observed.For example, in Ivich [24]' s study, where they tested the effect of missing cell types in simulated PBMC (peripheral blood mononuclear cells) samples using a reference constructed from the PBMC3k [25] dataset, worse deconvolution results were observed for the remaining cell types.This could be because the reference they used was not properly optimized and did not include sufficient cell state information, or because PBMC cells from health donors are intrinsically less diverse than those in the tumor microenvironment [26], making them more vulnerable to the missing cell-type problem.
Overall, our empirical tests indicate that missing cell types do not significantly hinder predictions for the remaining cell types.However, we still recommend using scRNA-seq data that encompasses a broad range of cell types for reference construction to achieve a more accurate representation of the tissue microenvironment.

Selection of cell types in the reference
The advancement of single-cell technology has helped uncover the complex tissue microenvironment, including rare cell types [27] and distinct cell subsets [22].Utilizing these finely annotated scRNA datasets for reference construction now presents a new challenge: whether to include all identified cell types.Typically, deconvolution performance for rare cell types is difficult to assessed experimentally using simulated data, as scRNA data used for bulk simulations may not include these new cell types or might fail to identify them.For this issue, we consider both including and excluding rare cell types as feasible options.The rationale for inclusion is that it could enhance the representation of tissue microenvironment, whereas the rationale for exclusion is that removing these cell types does not impact the prediction for other cell types, as we have demonstrated in the previous subsection, and rare cell types typically exhibit a low frequency (less than 0.1%) in deconvolution results (Fig. S7), suggesting that they may not present meaningful compositional differences.In InstaPrism built-in reference, we choose to include all cell types annotated from the scRNA data to ensure a comprehensive representation of the tumor microenvironment (Table S2).Note that with the flexibility of reference construction in InstaPrism, users can customize these references by removing or modifying different cell types as needed.For example, a rare cell type that consistently exhibits low frequency (less than 0.1%) in the provided bulk data and external bulk datasets like TCGA can be excluded from the reference.Additionally, if one is interested in estimating the overall fraction of myeloid populations, they can merge cell states associated with all myeloid cells, such as macrophages, monocytes, and dendritic cells into a new cell type in the reference to obtain the overall fraction estimates.Alternatively, if one is interested in a specific population of cell type, such as CD4 T cell fractions, they can isolate cell states associated with CD4 T cells and categorize them as a separate cell type within the reference to obtain associated fraction estimates.Together, this flexibility enhances the applicability of InstaPrism built-in reference and can be further generalized to other user-constructed references.

Text S4: Model convergence
Similar to BayesPrism, InstaPrism essentially optimizes the log likelihood of the underlying topic model.As the InstaPrism algorithm removed the Dirichlet prior α in the modeling, the formulation of the likelihood has been simplified from the original likelihood derived in BayesPrism.Specifically, consider the following notations: • Prior: AG×S: a gene by cell-state reference matrix, with g Ag,s = 1 for all cell state s ∈ [1, S] • Prior: XG×N : input bulk samples for deconvolution analysis, with genes in rows and samples in columns The likelihood of the model can now be expressed as: We note that the log likelihood function is directly influenced by the model's complexity, including factors such as number of bulk samples, number of genes and number of cell states in the reference.Over iterations, the model will optimize the posterior estimates until the log likelihood no longer improves.This is an intrinsic feature of topic modeling and does not depend on hyperparameter choices in the model.
To illustrate this, we plotted the log likelihood changes over iterations using a series of eight models on a simulated breast cancer dataset (Fig. S8).Model 1 utilizes the complete builtin BRCA_refPhi, while each subsequent model (Models 2 to 8) omits one cell type from the BRCA_refPhi at a time.Omitting a cell type means excluding all the associated cell states of that type.This process results in progressively reduced model complexity for Models 2 through 8.Although these models exhibit an increasing "missing cell type" problem, all models eventually converge, with simpler models tending to converge earlier.This suggests that the log likelihood is suitable only for monitoring convergence within the same model and cannot be used for direct comparisons between different models, as it is mainly impacted by the model's complexity.We note that calculating log likelihood over iterations is computational intensive, therefore we have included this only as a supplementary feature in our source code for interested users [28].In the main deconvolution function InstaPrism(), We implemented another simpler approach for convergence control.This involves calculating the absolute difference ("abs-difference") in fraction estimates between the penultimate and final iterations as an indicator of convergence.A high "abs-difference" value suggests that the model is still updating the fraction estimates and the log likelihood has not yet converged, whereas a small "abs-difference" value indicates that the model has likely reached convergence.Specifically, if the abs-difference for any cell type in any sample exceeds 0.01, the function will issue a warning and promote selecting a higher number of iterations.We have also incorporated an "abs-difference" value visualization feature (enabled by setting convergence.plot= TRUE) that aids in the monitoring of the model's convergence status.Taking an example from the InstaPrism tutorial (https://humengying0907.github.io/InstaPrism_tutorial.html): when the number of iterations is insufficient (as shown in Fig. S9a), the convergence checking heatmap will display high values of "abs-difference," prompting a warning to consider increasing the number of iterations.Conversely, with a higher number of iterations (as in Fig. S9b), the "abs-difference" becomes negligible across samples, indicating that the model no longer updates the fraction estimates and has thus achieved convergence.

Text S5: Construction & Validation of InstaPrism Built-in reference
To streamline the deconvolution process, we have provided a set of built-in references covering a broad range of cancer types, adhering to the guidelines for reference construction we proposed (see Text S2 for details).Specifically, we selected seven recent and large scale scRNA studies that provides extensive annotations at both the cell type and cell state levels (Table S1).The original cell type and cell state annotations are used to guide the cell state specification step during reference construction.For malignant cell types lacking cell state annotations, we either reclustered the cell types using the quick_cluster() function or relied on their source of origin (patient identifiers) to specify the cell states.Ribosomal and mitochondrial genes are removed from the reference following guidance from BayesPrism.All the reference are generated using the refPrepare() function in InstaPrism.The details of the cell type and cell state information in each built-in reference are listed in Table S2.

Validation using simulated data
To validate the performance of these built-in references, we applied the evaluation pipeline we proposed earlier (see Text S2.3 for details).Specifically, we simulated bulk samples using scRNAseq datasets that are different from the one used to generate the scRNA-based references.This approach ensures robust reference validation, as no information from the reference is used to generate the simulated bulk samples, mirroring real-world deconvolution scenarios when there are inevitable technical differences between the reference data and the bulk samples being analyzed [10].Details of the validation scRNA datasets used for bulk simulation can be found in Table S1.
Each simulated bulk dataset contains a total of 50 simulated samples, which comprise two parts: 1) pseudo-bulk samples created by aggregating cells from the same biological sample, and 2) heterogeneously simulated bulk samples, created using the heterogeneous bulk simulation strategy we proposed recently [8], a simulation method specifically designed to preserve appropriate sample-level heterogeneity in the simulated samples.In the pseudo-bulk samples, cell type fractions reflect the original distributions within the biological samples, whereas in the heterogeneously-simulated samples, fractions are drawn from an asymmetric Dirichlet distribution, with the α parameter determined by cell type frequencies from the source scRNA-seq dataset, and the shape parameter adjusted accordingly to ensure a proper fraction range.
Applying the InstaPrism built-in reference to the validation datasets, we first got a set of estimated fraction matrices, denoted as θ 50×k , where each column corresponds to a cell type present in the reference.Given that the simulated bulk samples are generated from a different scRNA-seq dataset than the one used for reference construction, the ground truth fraction matrices, θ 50×k ′ , do not necessarily have the same column names or the same number of columns as the estimated fraction matrices.This discrepancy is important for testing the generalizability of the reference.The matching process between reference and ground truth cell types is facilitated by finding the maximum Pearson correlation between the two sets of fraction matrices and manually adjusted to include only cell types present in both sets.
The reference performance was then assessed using Pearson correlation between the ground truth and estimated fractions for each matched cell type.In Fig. S10, we summarized the Pearson correlation and plotted it against the average fraction of each cell type in the simulated samples.As previously explained, the fractions in these samples are designed to mirror the actual abundance distributions of cell types in the scRNA data used for the bulk simulation.Therefore, the average cell type fraction (x axis) in this context serves as a proxy for the typical cell type abundance in the tumor tissue.A detailed scatter plot comparing the estimated and ground truth cell type fractions in the validation dataset can be found at Fig. S12.
Note that for this evaluation we focus exclusively on the results from the built-in referencebased deconvolution, without the "reference update" step (Text S1).Across the validation datasets, the estimated fractions closely align with the ground truth fractions.with an average Pearson correlation of 0.92 per cell type.Again, we highlight that this performance is observed on bulk samples that are simulated without borrowing any known cellular information from the built-in reference and are designed to exhibit not merely proportional differences, but also appropriate sample-level heterogeneity [8].Although these features represent typical challenges in real-world deconvolution tasks [8], the InstaPrism algorithm, combined with the built-in reference, achieves high performance, highlighting the robustness of the algorithm and the generalizability of the built-in references.We observed three instances of notably lower performance (correlation < 0.5) in our validation datasets: the CRC-epi, LUAD-epithelial, and RCC-dc signatures.Two of these instances involve normal epithelial cells and one involves dendritic cells.Both cell types are typically present in low frequencies in bulk samples and are usually identifiable only at single-cell resolution.We note that it is possible the scRNA-seq datasets used for reference construction and bulk simulation may have applied different criteria for labeling these cell types.Therefore, the predicted signatures are not align with the established ground truths.Indeed, normal epithelial cells often have overlapping signatures with cancerous epithelial cells, and dendritic cells are frequently found to be mixed the myeloid cells in low-dimensional representations such as UMAP [22,29].We suggest that these instances of low performance do not necessarily indicate incorrect signatures in the built-in reference, but rather highlight the challenges in accurately representing these cell types in general.Given that these cell types are typically infrequent, exhibit minimal compositional differences, and do not affect the deconvolution results of other cell types, it may be advisable to exclude them from the reference in real practice.
Additionally, two other signatures-GBM-oligodendrocyte and RCC-Bcell-show moderate performance (correlation < 0.75).Note that this occurs in one validation dataset but not in others, which may again be due to inconsistent labeling in the lower-performing validation dataset.For these signatures, we recommend retaining them in the reference as they still serve as useful guides for understanding the compositional information of the bulk samples.

Validation using TCGA data
We also incorporated real bulk samples from the TCGA cohort data (Table S3) to further assess the performance of the built-in reference.Following our proposed evaluation pipeline, we downloaded TCGA gene expression data along with associated tumor purity estimates [5,6,7].These tumor purity estimation methods, originally developed to measure the proportion of tumor cells from the samples [30], were designed to provide insight into the tumor microenvironment and have been shown to be important in genomic analyses, such as tumor mutation burden [31] and intratumor heterogeneity estimation [32].In our validation, we used these estimates as proxies to ground truth malignant proportions.
In Fig. S11a, we summarized the performance of each built-in reference by calculating the correlation between the malignant estimates and various TCGA tumor purity estimates.As an extension, we also included other deconvolution methods including CIBERSORTx [33], EPIC [34], DeconRNASeq [35], and DeMiXT [36] in the comparison.The malignant fraction estimates from these methods were obtained from Revkov et al. [30]; in their work, they applied these deconvolution methods to TCGA samples and reported the malignant fraction estimates specifically.
Across various measures of TCGA tumor purity, the malignant fraction estimates from In-staPrism using the built-in reference, generally reflect tumor purity estimates and consistently rank at the top compared to other methods.We note that since different purity estimation methods utilize different modalities [7] of data and follow distinct underlying estimation criteria, the resulting purity estimates should only be considered as approximations, not the definitive ground truth measurements.
For transcription-based purity methods ESTIMATE [6], which use the same modality data-gene expression data-as that used in deconvolution analyses, all methods demonstrate better correlation compared to other purity estimation methods, with InstaPrism achieving an average Pearson correlation of 0.79 across different tumor types.For the somatic copy-number alteration-based method ABSOLUTE [5], InstaPrism exhibits an average Pearson correlation of 0.6, surpassing other advanced methods.For other TCGA purity estimation methods such as CPE [7] and LUMP [7], InstaPrism with the built-in reference generally achieves moderate correlation.Less association was observed for IHC, an image-based purity estimation method, across all deconvolution methods; this is consistent with previous findings that gene expression-based estimates of malignancy typically correlate poorly with IHC-based purity measures [1].Overall, the malignant fraction estimates using the built-in reference properly reflect TCGA tumor purity estimates, suggesting its suitability for tumor purity estimation purposes in facilitating genomic analyses such as tumor mutation burden [31] and intratumor heterogeneity estimation [32].
We note that the TCGA-RCC dataset is a significant outlier, with the estimated malignant signature poorly associated with any TCGA purity estimates (cor < 0.2, Fig. S11b).We hypothesized that the purity estimates being compared do not accurately reflect the cancerous epithelial cells but also include normal epithelial cells misclassified as malignant.Indeed, when we compared the combined fractions estimates for both malignant and normal epithelial cells, the correlation showed a significant improvement.Therefore, we recommend exercising caution when using purity estimates from the TCGA-RCC dataset in future analyses.
Together, we have demonstrated good performance and generalizability of the InstaPrism builtin reference by validating it using both the simulated and real bulk samples.The reference captures comprehensive insights into the cellular components of the underlying tumor microenvironment, offering a ready-to-use option for the bulk deconvolution task.To reproduce these results or to test with your own data, please refer to our source code at https://github.com/humengying0907/InstaPrismSourceCode.  1: validation scRNA-seq datasets from the same cancer type are used to generate simulated bulk samples, each containing 50 simulated bulk samples 2: bulk samples simulated from the validation dataset are associated with these sections 3: GBM_refPhi is constructed from a harmonized single-cell GB core reference [27], comprising 16 scRNA-seq studies across multiple platforms

sample•
Input: A G×S : gene expression reference matrix, with g A g,s = 1 for all cell state s ∈ [1, S] • Intermediate variable (for InstaPrism only): B G×S : probability of assigning a read in gene g ∈ [1, G] to a cell-state s ∈ [1, S] • Gibbs sampling parameter (for BayesPrism only): α = 10 −8 , a non-informative Dirichlet prior • Output: θ S×1 : cell-state fractions • Output (for InstaPrism only): W G×1 : scaling matrix to reconstruct Z G×S • Output (for BayesPrism only): Z G×S : cellstate specific gene expression

Z=
g,s = X g for each g, θ s for each s

Figure S1 :
Figure S1: Fraction update trajectory.Lineplot showing how cell-type fractions change over iterations in BayesPrism and InstarPrism for one bulk sample from the tutorial data [4] provided in BayesPrism.

Figure S2 :
Figure S2: Effect of reference update in validation datasets.Scatter plot comparing the deconvolution performance, as evaluated by Pearson correlation, between scRNA-based reference (InstaPrism built-in reference) and updated reference.a Using real bulk samples from the TCGA cohorts, each dot represents the Pearson correlation between the estimated malignant fractions and one kind of TCGA tumor purity estimates, which serves as a proxy for ground truth malignant fractions.b Using simulated samples with known cell type fractions, each dot represents the Pearson correlation between the estimated and ground truth cell type fractions.Different colors correspond to different cell types, with the black color represents the malignant cell type specifically.Higher correlations indicate better deconvolution performance.

Figure S4 :
Figure S4: Effect of duplicated reference.a When duplicated cell states present in the reference (Oligodendrocyte1 and Oligodendrocyte2), fraction estimates for duplicated cell states are identical.b Duplicated cell states each explains part of the ground truth Oligodendrocyte fraction.c Fraction estimates from duplicate reference is additive, matching up with ground truth fraction levels.

Figure S6 :
Figure S6: Effect of missing cell types in the reference.Heatmap showing the deconvolution performance using reference with varying levels of missing cell types on simulated LUAD samples.The performance is evaluated by a Pearson correlation and b RMSE values between the estimated and ground truth fractions of the remaining cell types.Each row corresponds to a different reference configuration, ranging from no missing cell types to five missing cell types.

Figure S7 :
Figure S7: An example of rare cell type deconvolution results.A pie plot showing the average fraction estimates of three rare cell types from 689 TCGA-CRC samples, using the built-in CRC reference (CRC_refPhi).For visibility, the labeled segmentations in the pie plot are scaled up 10 times and are not proportional to the actual percentages.The interquartile range (IQR), indicating the 25-75% range of these fraction estimates, is labeled in the legend.

Figure S8 :
Figure S8: Convergence of log likelihood over iterations.Each curve represents a deconvolution process with different number of cell types included in the reference, when running deconvolution on a simulated breast cancer dataset.

Figure S9 :
Figure S9: An example of convergence plot from the InstaPrism() function.Heatmap comparing the absolute difference in fraction estimates between the penultimate and final iterations for all input samples, as an indicator of model convergence.Scenario a displays a significant difference in fraction estimates due to insufficient iterations (n.iter = 20), indicating a lack of model convergence, whereas scenario b presents negligible differences in fraction estimates across samples with sufficient iterations (n.iter = 100), suggesting that the model no longer updates fraction estimates and has thus achieved convergence.

Figure S10 :
Figure S10: Performance of InstaPrism built-in reference on simulated data.Scatter plot showing the performance of each cell type reference against the relative abundance of the cell type in simulated datasets.The x-axis represents the average fraction of each cell type, and the y-axis corresponds to the Pearson correlation coefficient between the estimated and ground truth cell type proportions.Different colors correspond to different cell types from the reference, while different shapes denote different validation datasets used under the same reference.Cell types with lower performance (correlation < 0.75) are indicated with text labels.

Figure S11 :
Figure S11: Performance of InstaPrism built-in reference on TCGA purity estimation.a Boxplot comparing the performance of different deconvolution methods in estimating TCGA tumor purity, measured by Pearson correlation between the estimated malignant proportion and the known TCGA tumor purity quantities.Each dot represents a tumor type and each panel of the figure corresponds to a specific TCGA tumor purity estimation method.b Scatter plot comparing the correlation values when comparing tumor purity estimates with malignant proportion estimates versus malignant plus normal epithelial proportions estimates in TCGA RCC samples, using the built-in RCC reference (RCC_refPhi), with each dot corresponding to a different TCGA purity estimation method.

Figure S12 :
Figure S12: Detailed deconvolution results from validation datasets.Scatter plot comparing ground truth fraction (x axis) and fraction estimates (y axis) for matched cell-types in each validation dataset, using InstaPrism built-in reference.

Table S3 : Summary of TCGA datasets used in reference evaluation
[7]atic copy-number data based purity estimation method[5]2: CPE: Consensus purity estimation method integrating purity estimates from ABSOLUTE, ESTIMATE, LUMP, and IHC[7]3: ESTIMATE: Gene expression based purity estimation method[6]4: IHC: Image analysis of haematoxylin and eosin stain slides produced by the Nationwide Children's Hospital Biospecimen Core Resource[7]5: LUMP: DNA methylation based purity estimation method[7]6: both colon adenocarcinoma and rectum adenocarcinoma samples are collected from TCGA for CRC_refPhi validation